class: center, middle, inverse, title-slide # APSTA-GE 2003: Intermediate Quantitative Methods ## Lab Section 003, Week 4 ### New York University ### 09/29/2020 --- ## Reminders - Assignment 2 - Due: **10/02/2020 11:55pm (EST)** - Office hours - Monday 9 - 10am (EST) - Wednesday 12:30 - 1:30pm (EST) - Office hour Zoom link - https://nyu.zoom.us/j/97347070628 (pin: 2003) - Office hour notes - Available on NYU Classes under the "Resources" tab --- ## Fun slide - **"What do you call a regression model with only two data points?"** - A line. --- ## Let's build the fun part **Let's create a simulated data set with only two data points.** ```r first_point <- c(25, 10) second_point <- c(5, 30) dat_for_fun <- data.frame( X = c(first_point[1], second_point[1]), Y = c(first_point[2], second_point[2]) ) dat_for_fun ``` ``` ## X Y ## 1 25 10 ## 2 5 30 ``` --- ## Here comes the fun part **Let's fit a regression model and talk about the perfect R-squared and those NAs for half an hour.** ```r # Let's fit a linear regression model lm_for_fun <- lm(formula = Y ~ X, data = dat_for_fun) summary(lm_for_fun) ``` ``` ## ## Call: ## lm(formula = Y ~ X, data = dat_for_fun) ## ## Residuals: ## ALL 2 residuals are 0: no residual degrees of freedom! ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 35 NA NA NA ## X -1 NA NA NA ## ## Residual standard error: NaN on 0 degrees of freedom ## Multiple R-squared: 1, Adjusted R-squared: NaN ## F-statistic: NaN on 1 and 0 DF, p-value: NA ``` --- ## More fun? Yes! **To make things even better, let's plot the results!**
--- ## How about this? **How about four data points?**
--- ## Let's check the result *(25, 10), (10, 15), (5, 30), (30, 20)* ```r lm2_for_fun <- lm(formula = Y ~ X, data = dat2_for_fun) summary(lm2_for_fun) ``` ``` ## ## Call: ## lm(formula = Y ~ X, data = dat2_for_fun) ## ## Residuals: ## 1 2 3 4 ## -5.882 6.471 6.029 -6.618 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 25.4412 8.7181 2.918 0.100 ## X -0.3824 0.4293 -0.891 0.467 ## ## Residual standard error: 8.849 on 2 degrees of freedom ## Multiple R-squared: 0.284, Adjusted R-squared: -0.07395 ## F-statistic: 0.7934 on 1 and 2 DF, p-value: 0.4671 ``` --- ## Seamlessly switching to today's topics **Distance to the mean**: the distance between each data point and the average DV value.
--- ## Sum of Squares (SS) What is **Sum of Squares**? The sum of squared distance. `$$\sum_{i=1}^n \ \ \ (Y_i - \overline{Y}) \ \ \ ^2$$` --- ## Three types of SS **Total Sum of Squares (TSS)**: Sum of squared distance between observed data and the mean `$$TSS = \sum_{i=1}^n (Y_i - \overline{Y})^2$$` **Model (Explained) Sum of Squares (MSS)**: Sum of squared distance included in the model `$$MSS = \sum_{i=1}^n (\hat{Y_i} - \overline{Y_i})^2$$` **Residual Sum of Squares (RSS)**: Sum of squared distance not included in the model `$$RSS = \sum_{i=1}^n e_{i}^2 = \sum_{i=1}^n (Y_i - \hat{Y_i})^2$$` --- ## Total Sum of Squares (TSS) `$$TSS = \sum_{i=1}^n (Y_i - \overline{Y})^2$$`
--- ## Total Sum of Squares (TSS) cont. `$$\begin{align} TSS_{fun} & = \sum_{i=1}^4 (Y_i - \overline{Y})^2 \\ & = (10 - 18.75)^2 + (15 - 18.75)^2 + (30 - 18.75)^2 + (20 - 18.75)^2 \\ & = 218.75 \end{align}$$` --- ## Model Sum of Squares (MSS) `$$MSS = \sum_{i=1}^n (\hat{Y_i} - \overline{Y_i})^2$$`
--- ## Model Sum of Squares (MSS) cont. `$$\begin{align} MSS_{fun} & = \sum_{i=1}^4 (\hat{Y_i} - \overline{Y_i})^2 \\ & = (15.88235 - 18.75)^2 + (23.52941 - 18.75)^2 + (13.97059 - 18.75)^2 + (21.61765 - 18.75)^2 \\ & = 62.1323529412 \end{align}$$` --- ## Residual Sum of Squares (RSS) `$$RSS = \sum_{i=1}^n e_{i}^2 = \sum_{i=1}^n (Y_i - \hat{Y_i})^2$$`
--- ## Residual Sum of Squares (RSS) cont. `$$\begin{align} RSS_{fun} & = \sum_{i=1}^4 (Y_i - \hat{Y_i})^2 \\ & = (10 - 15.88235)^2 + (30 - 23.52941)^2 + (20 - 13.97059)^2 + (15 - 21.617656)^2 \\ & = 156.617732353 \end{align}$$` --- ## TSS, MSS, and RSS In our case: `$$TSS_{fun} = 218.75 \\ MSS_{fun} = 62.1323529412 \\ RSS_{fun} = 156.617732353$$` `$$TSS_{fun} = 218.75\\ MSS_{fun} + RSS_{fun} = 62.1323529412 + 156.617732353 = 218.750085294 \\ TSS_{fun} = MSS_{fun} + RSS_{fun}$$` --- ## Analysis of Variance ```r anova(lm2_for_fun) ``` ``` ## Analysis of Variance Table ## ## Response: Y ## Df Sum Sq Mean Sq F value Pr(>F) ## X 1 62.132 62.132 0.7934 0.4671 ## Residuals 2 156.618 78.309 ``` `$$F_{fun} = \frac{MSS_{fun}/(p-1)}{RSS_{fun}/(n-p)} = \frac{62.132}{156.618/(4 - 2)} = 0.793$$` where `\(p\)` is the degrees of freedom. --- ## R Squared `\(R^2\)` is the percentage of variability in the data set explained by the model. `$$R^2 = \frac{MSS}{TSS} = 1 - \frac{RSS}{TSS}$$` In our case: `$$\begin{align} R_{fun}^2 & = \frac{MSS_{fun}}{TSS_{fun}} = 1 - \frac{RSS_{fun}}{TSS_{fun}} \\ & = \frac{62.132}{218.75} = 1 - \frac{156.618}{218.75} \\ & = 0.284 \end{align}$$`